Kyle MacLaughlin's Resume Projects

Summary

Here I have compiled into one Jupyter notebook my implementations of a few statistical and machine learning tasks applying common methods to datasets from a variety of fields.

Table of Contents

  1. Detecting Tumors With A Deep Convolutional Network
  2. A Recommender System For Netflix Movies
  3. Sentiment Analysis In Product Reviews

Detecting Tumors With A Deep Convolutional Network

Background

One challenge in modern medicine is identifying tumors and other anomalies on brain MR scans. The goal of developing increasingly sophisticated systems for identifying tumors (and other visible abnormalities) is to aid radiologists in their meticulous searches. In this section, I will create a deep convolutional neural network in order to classify scans as positive (tumorous) or negative (typical).

Implementation Details

In order to create this model, I downloaded the Br35H :: Brain Tumor Detection 2020 image dataset published on Kaggle (https://www.kaggle.com/datasets/ahmedhamada0/brain-tumor-detection) by Ahmed Hamada. This dataset contains two folders labelled "yes" (positive) and "no" (negative), each containing 1500 images of axial slices taken from MR scans. I will segment two-thirds of the data (2000 images) into a training set, and the rest will be placed in the test set. Only one model will be used (for simplicity), so no validation set is required.

Preparing the data

Building & training the model

When we finally build our model, we identify increasingly high-level features with a series of convolutional layers, followed by a sudden dimensionality reduction, and feed this into our output layer. The output layer consists of two units with a softmax activation for our categorical data, which is accompanied by an appropriate binary cross-entropy loss function. For our dimensionality-reduction layers, we include regularization penalties to ensure that parameters do not grow too large.

Testing the model

Now that we have constructed our model and fitted it to the training set, we will put it to the test! We will compute the balanced error rate (BER) of the model on the testing set. This is the mean of the false positive and negative rates, and provides a metric for how well the model generalizes to the rest of the data.

We observe a BER of 0.09, meaning our model is fairly accurate. While this result is not state of the art, it could certainly be useful in the context of flagging potential tumors for radiologists to examine—especially with a tweak to the classification threshold.

A Recommender System For Netflix Movies :popcorn::dvd:

Background

In 2006, Netflix issued a challenge that was hard (for us plebeians) to ignore: any team that developed a recommender system for movies using data collected by the streaming service that could beat their in-house Cinematch algorithm's error score by at least 10% would receive a $1 million prize. In spite of the competition having ended more than a decade ago, I have decided to put my skills to the test by developing a recommender system of my own using the same dataset.

Implementation Details

I downloaded the official Netflix Prize dataset published by Netflix on Kaggle (https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data). In this example, I aim to beat the company's original RMSE (root-mean-square error) of 0.9525 as per the contest rules.

Sentiment Analysis In Product Reviews

Background

It is often useful to determine more information about users' preferences than merely their binary or numerical ratings of a product or service. For example, a small or mid-sized business might collect data from its customers, but those data might be too sparse to explain the consumers' reasons for making purchases or leaving reviews. In such cases, it is useful to identify latent factors in their thinking, referred to as sentiment.